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Abstract 

Malicious  code  detection  is  a  crucial  component  of  any  defense  mechanism.  In  this  paper,  we  present  a  unique  view¬ 
point  on  malicious  code  detection.  We  regard  malicious  code  detection  as  an  obfuscation-deobfuscation  game  between 
malicious  code  writers  and  researchers  working  on  malicious  code  detection.  Malicious  code  writers  attempt  to  obfus¬ 
cate  the  malicious  code  to  subvert  the  malicious  code  detectors,  such  as  anti-virus  software.  We  tested  the  resilience  of 
three  commercial  virus  scanners  against  code-obfuscation  attacks.  The  results  were  surprising:  the  three  commercial 
virus  scanners  could  be  subverted  by  very  simple  obfuscation  transformations!  We  present  an  architecture  for  detect¬ 
ing  malicious  patterns  in  executables  that  is  resilient  to  common  obfuscation  transformations.  Experimental  results 
demonstrate  the  efficacy  of  our  prototype  tool,  SAFE  (a  static  analyzer  for  executables). 


1  Introduction 

In  the  interconnected  world  of  computers,  malicious 
code  has  become  an  omnipresent  and  dangerous  threat. 
Malicious  code  can  infiltrate  hosts  using  a  variety  of 
methods  such  as  attacks  against  known  software  flaws, 
hidden  functionality  in  regular  programs,  and  social  en¬ 
gineering.  Given  the  devastating  effect  malicious  code 
has  on  our  cyber  infrastructure,  identifying  malicious 
programs  is  an  important  goal.  Detecting  the  presence 
of  malicious  code  on  a  given  host  is  a  crucial  component 
of  any  defense  mechanism. 

Malicious  code  is  usually  classified  [30]  according  to 
its  propagation  method  and  goal  into  the  following  cate¬ 
gories: 

•  viruses  are  programs  that  self-replicate  within  a  host 
by  attaching  themselves  to  programs  and/or  documents 
that  become  carriers  of  the  malicious  code; 

•  worms  self-replicate  across  a  network; 

•  trojan  horses  masquerade  as  useful  programs,  but  con¬ 
tain  malicious  code  to  attack  the  system  or  leak  data; 

•  back  doors  open  the  system  to  external  entities  by  sub- 
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verting  the  local  security  policies  to  allow  remote  access 
and  control  over  a  network; 

•  spyware  is  a  useful  software  package  that  also  trans¬ 
mits  private  user  data  to  an  external  entity. 

Combining  two  or  more  of  these  malicious  code  cate¬ 
gories  can  lead  to  powerful  attack  tools.  For  example,  a 
worm  can  contain  a  payload  that  installs  a  back  door  to 
allow  remote  access.  When  the  worm  replicates  to  a  new 
system  (via  email  or  other  means),  the  back  door  is  in¬ 
stalled  on  that  system,  thus  providing  an  attacker  with  a 
quick  and  easy  way  to  gain  access  to  a  large  set  of  hosts. 
Stamford  el  al.  have  demonstrated  that  worms  can  prop¬ 
agate  extremely  quickly  through  a  network,  and  thus 
potentially  cripple  the  entire  cyber  infrastructure  [43]. 
In  a  recent  outbreak,  the  Sapphire/SQL  Slammer  worm 
reached  the  peak  infection  rate  in  about  10  minutes  since 
launch,  doubling  every  8.5  seconds  [31].  Once  the  back¬ 
door  tool  gains  a  large  installed  base,  the  attacker  can 
use  the  compromised  hosts  to  launch  a  coordinated  at¬ 
tack,  such  as  a  distributed  denial-of-service  (DDoS)  at¬ 
tack  [5], 

In  this  paper,  we  develop  a  methodology  for  detecting 
malicious  patterns  in  executables.  Although  our  method 
is  general,  we  have  initially  focused  our  attention  on 
viruses.  A  computer  virus  replicates  itself  by  inserting 
a  copy  of  its  code  (the  viral  code )  into  a  host  program. 
When  a  user  executes  the  infected  program,  the  virus 
copy  runs,  infects  more  programs,  and  then  the  original 
program  continues  to  execute.  To  the  casual  user,  there 
is  no  perceived  difference  between  the  clean  and  the  in- 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2006 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2006  to  00-00-2006 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Static  Analysis  of  Executables  to  Detect  Malicious  Patterns 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  Wisconsin  , Computer  Sciences  Department, 716  Langdon 
Street, Madison, WI, 53706 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

18 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


fected  copies  of  a  program  until  the  vims  activates  its 
malicious  payload. 

The  classic  virus-detection  techniques  look  for  the 
presence  of  a  virus-specific  sequence  of  instructions 
(called  a  virus  signature)  inside  the  program:  if  the  sig¬ 
nature  is  found,  it  is  highly  probable  that  the  program  is 
infected.  For  example,  the  Chernobyl/CIH  virus  is  de¬ 
tected  by  checking  for  the  hexadecimal  sequence  [47]: 

E800  0000  005B  8D4B  4251  5050 

0F01  4C24  FE5B  83C3  1CFA  8B2B 

This  corresponds  to  the  following  IA-32  instruction  se¬ 
quence,  which  constitutes  part  of  the  virus  body: 

E8  00000000  call  Oh 

5B  pop  ebx 

8D  4B  42  lea  ecx,  [ebx  +  42h] 

51  push  ecx 

50  push  eax 

50  push  eax 

0F01  4C  24  FE  sidt  [esp  -  02h] 

5B  pop  ebx 

83  C3  1C  add  ebx,  ICh 

FA  cli 

8B  2B  mov  ebp,  [ebx] 

This  classic  detection  approach  is  effective  when  the 
virus  code  does  not  change  significantly  over  time.  De¬ 
tection  is  also  easier  when  viruses  originate  from  the 
same  source  code,  with  only  minor  modifications  and 
updates.  Thus,  a  virus  signature  can  be  common  to  sev¬ 
eral  virus  variants.  For  example,  Chernobyl/CIH  ver¬ 
sions  1.2,  1.3,  and  1.4  differ  mainly  in  the  trigger  date 
on  which  the  malicious  code  becomes  active  and  can  be 
effectively  detected  by  scanning  for  a  single  signature, 
namely  the  one  shown  above. 

The  virus  writers  and  the  antivims  software  develop¬ 
ers  are  engaged  in  an  obfuscation-deobfuscation  game. 
Vims  writers  try  to  obfuscate  the  “vanilla”  virus  so  that 
signatures  used  by  the  antivirus  software  cannot  detect 
these  “morphed”  vimses.  Therefore,  to  detect  an  obfus¬ 
cated  vims,  the  virus  scanners  first  must  undo  the  ob¬ 
fuscation  transformations  used  by  the  virus  writers.  In 
this  game,  virus  writers  are  obfuscators  and  researchers 
working  on  malicious  code  detection  are  deobfuscators. 
A  method  to  detect  malicious  code  should  be  resistant 
to  common  obfuscation  transformations.  This  paper  in¬ 
troduces  such  a  method.  The  main  contributions  of  this 
paper  include: 

•  The  obfuscation-deobfuscation  game  and  attacks 
on  commercial  virus  scanners 

We  view  malicious  code  detection  as  an  obfuscation- 
deobfuscation  game  between  the  virus  writers  and  the 
researchers  working  to  detect  malicious  code.  Back¬ 


ground  on  some  common  obfuscation  techniques  used 
by  virus  writers  is  given  in  Section  3.  We  also  have  de¬ 
veloped  an  obfuscator  for  executables.  Surprisingly,  the 
three  commercial  virus  scanners  we  considered  could  be 
easily  thwarted  by  simple  obfuscation  transformations 
(Section  4).  For  example,  in  some  cases  the  Norton  an¬ 
tivirus  scanner  could  not  even  detect  insertions  of  nop 
instructions. 

•  A  general  architecture  for  detecting  malicious  pat¬ 
terns  in  executables 

We  introduce  a  general  architecture  for  detecting  mali¬ 
cious  patterns  in  executables.  An  overview  of  the  archi¬ 
tecture  and  its  novel  features  is  given  in  Section  5.  Ex¬ 
ternal  predicates  and  uninterpreted  symbols  are  two  im¬ 
portant  elements  in  our  architecture.  External  predicates 
are  used  to  summarize  results  of  various  static  analyses, 
such  as  points-to  and  live-range  analysis.  We  allow  these 
external  predicates  to  be  referred  in  the  abstraction  pat¬ 
terns  that  describe  the  malicious  code.  Moreover,  we 
allow  uninterpreted  symbols  in  patterns,  which  makes 
the  method  resistant  to  renaming,  a  common  obfusca¬ 
tion  transformation.  Two  key  components  of  our  archi¬ 
tecture,  the  program  annotator  and  the  malicious  code 
detector ,  are  described  in  Sections  6  and  7  respectively. 

•  Prototype  for  x86  executables 

We  have  implemented  a  prototype  for  detecting  mali¬ 
cious  patterns  in  x86  executables.  The  tool  is  called  a 
static  analyzer  for  executables  or  SAFE.  We  have  suc¬ 
cessfully  tried  SAFE  on  multiple  viruses;  for  brevity  we 
report  on  our  experience  with  four  specific  viruses.  Ex¬ 
perimental  results  (Section  8)  demonstrate  the  efficacy 
of  SAFE.  There  are  several  interesting  directions  we  in¬ 
tend  to  pursue  as  future  work,  which  are  summarized  in 
Section  9. 

•  Extensibility  of  analysis 

SAFE  depends  heavily  on  static  analysis  techniques.  As 
a  result,  the  precision  of  the  tool  directly  depends  on  the 
static  analysis  techniques  that  are  integrated  into  it.  In 
other  words,  SAFE  is  as  good  as  the  static  analysis  tech¬ 
niques  it  is  built  upon.  For  example,  if  SAFE  uses  the 
result  of  points-to  analysis,  it  will  be  able  to  track  values 
across  memory  references.  In  the  absence  of  a  points- 
to  analyzer,  SAFE  makes  the  conservative  assumption 
that  a  memory  reference  can  access  any  memory  loca¬ 
tion  (i.e.,  everything  points  to  everything).  We  have  de¬ 
signed  SAFE  so  that  various  static  analysis  techniques 
can  be  readily  integrated  into  it.  Several  simple  static 
analysis  techniques  are  already  implemented  in  SAFE. 

2  Related  Work 

2.1  Theoretical  Discussion 

The  theoretical  limits  of  malicious  code  detection 
(specifically  of  virus  detection)  have  been  the  focus  of 
many  researchers.  Cohen  [10]  and  Chess-White  [9] 


showed  that  in  general  the  problem  of  vims  detec¬ 
tion  is  undecidable.  Similarly,  several  important  static 
analysis  problems  are  undecidable  or  computationally 
hard  [28,  35]. 

However,  the  problem  considered  in  this  paper  is 
slightly  different  than  the  one  considered  by  Cohen  [10] 
and  Chess- White  [9].  Assume  that  we  are  given  a  vanilla 
vims  V  which  contains  a  malicious  sequence  of  instruc¬ 
tions  a.  Next  we  are  given  an  obfuscated  version  0(V) 
of  the  virus.  The  problem  is  to  find  whether  there  ex¬ 
ists  a  sequence  of  instructions  a'  in  0(V)  which  is  “se¬ 
mantically  equivalent”  to  a.  A  recent  result  by  Vadhan 
et  al.  [3]  proves  that  in  general  program  obfuscation  is 
impossible.  This  leads  us  to  believe  that  a  computation¬ 
ally  bounded  adversary  will  not  be  able  to  obfuscate  a 
vims  to  completely  hide  its  malicious  behavior.  We  will 
further  explore  these  theoretical  issues  in  the  future. 

2.2  Other  Detection  Techniques 

Our  work  is  closely  related  to  previous  results  on 
static  analysis  techniques  for  verifying  security  proper¬ 
ties  of  software  [1,  4,  8,  7,  25,  29],  In  a  larger  con¬ 
text,  our  work  is  similar  to  existing  research  on  soft¬ 
ware  verification  [2,  13].  However,  there  are  several  im¬ 
portant  differences.  First,  viewing  malicious  code  de¬ 
tection  as  an  obfuscation-deobfuscation  game  is  unique. 
The  obfuscation-deobfuscation  viewpoint  lead  us  to  ex¬ 
plore  obfuscation  attacks  upon  commercial  virus  scan¬ 
ners.  Second,  to  our  knowledge,  all  existing  work  on 
static  analysis  techniques  for  verifying  security  proper¬ 
ties  analyze  source  code.  On  the  other  hand,  our  analy¬ 
sis  technique  works  on  executables.  In  certain  contexts, 
such  as  virus  detection,  source  code  is  not  available.  Fi¬ 
nally,  we  believe  that  using  uninterpreted  variables  in  the 
specification  of  the  malicious  code  is  unique  (Section 
6.2). 

Currie  et  al.  looked  at  the  problem  of  automatically 
checking  the  equivalence  of  DSP  routines  in  the  context 
of  verifying  the  correctness  of  optimizing  transforma¬ 
tions  [15],  Their  approach  is  similar  to  ours,  but  they 
impose  a  set  of  simplifying  assumptions  for  their  simula¬ 
tion  tool  to  execute  with  reasonable  performance.  Feng 
and  Hu  took  this  approach  one  step  further  by  using  a 
theorem  prover  to  determine  when  to  unroll  loops  [19]. 
In  both  cases  the  scope  of  the  problem  is  limited  to 
VLIW  or  DSP  code  and  there  is  limited  support  for  user- 
specified  analyses.  Our  work  is  applied  to  x86  (IA-32) 
assembly  and  can  take  full  advantage  of  static  analyses 
available  to  the  user  of  our  SAFE  tool.  Necula  adopts 
a  similar  approach  based  on  comparing  a  transformed 
code  sequence  against  the  original  code  sequence  in  the 
setting  of  verifying  the  correctness  of  the  GNU  C  com¬ 
piler  [38].  Using  knowledge  of  the  transformations  per¬ 
formed  by  the  compiler,  equivalence  between  the  com¬ 


piler  input  and  the  compiler  output  is  proven  using  a 
simulation  relation.  In  our  work,  we  require  no  a  pri¬ 
ori  knowledge  of  the  obfuscation  transformations  per¬ 
formed,  as  it  would  be  unrealistic  to  expect  such  infor¬ 
mation  in  the  presence  of  malicious  code. 

We  plan  to  enhance  our  framework  by  using  the  ideas 
from  existing  work  on  type  systems  for  assembly  code. 
We  are  currently  investigating  Morrisett  et  al.' s  Typed 
Assembly  Language  [32,  33],  We  apply  a  simple  type 
system  (Section  6)  to  the  binaries  we  analyze  by  manu¬ 
ally  inserting  the  type  annotations.  We  are  unaware  of 
a  compiler  that  can  produce  Typed  Assembly  Language, 
and  thus  we  plan  to  support  external  type  annotations  to 
enhance  the  power  of  our  static  analysis. 

Dynamic  monitoring  can  also  be  used  for  malicious 
code  detection.  Cohen  [10]  and  Chess- White  [9]  pro¬ 
pose  a  virus  detection  model  that  executes  code  in  a 
sandbox.  Another  approach  rewrites  the  binary  to  in¬ 
troduce  checks  driven  by  an  enforceable  security  pol¬ 
icy  [17]  (known  as  the  inline  reference  monitor  or  the 
IRM  approach).  We  believe  static  analysis  can  be  used  to 
improve  the  efficiency  of  dynamic  analysis  techniques, 
e.g.,  static  analysis  can  remove  redundant  checks  in  the 
IRM  framework.  We  construct  our  models  for  exe¬ 
cutables  similar  to  the  work  done  in  specification-based 
monitoring  [21,  46],  and  apply  our  detection  algorithm 
in  a  context-insensitive  fashion.  Other  research  used 
context-sensitive  analysis  by  employing  push-down  sys¬ 
tems  (PDSs).  Analyses  described  in  [7,  25]  use  the 
model  checking  algorithms  for  pushdown  systems  [18] 
to  verify  security  properties  of  programs.  The  data  struc¬ 
tures  used  in  interprocedural  slicing  [23],  interprocedu¬ 
ral  DFA  [40],  and  Boolean  programs  [2]  are  hierarchi¬ 
cally  structured  graphs  and  can  be  translated  to  push¬ 
down  systems. 

2.3  Other  Obfuscators 

While  deciding  on  the  initial  obfuscation  techniques 
to  focus  on,  we  were  influenced  by  several  existing  tools. 
Mistfall  (by  zOmbie)  is  a  library  for  binary  obfuscation, 
specifically  written  to  blend  malicious  code  into  a  host 
program  [49],  It  can  encrypt,  morph,  and  blend  the  virus 
code  into  the  host  program.  Our  binary  obfuscator  is 
very  similar  to  Mistfall.  Unfortunately,  we  could  not 
successfully  morph  binaries  using  Mistfall,  so  we  could 
not  perform  a  direct  comparison  between  our  obfusca¬ 
tor  and  Mistfall.  Burneye  (by  TESO )  is  a  Linux  binary 
encapsulation  tool.  Burneye  encrypts  a  binary  (possibly 
multiple  times),  and  packages  it  into  a  new  binary  with 
an  extraction  tool  [45].  In  this  paper,  we  have  not  con¬ 
sidered  encryption  based  obfuscation  techniques.  In  the 
future,  we  will  incorporate  encryption  based  obfuscation 
techniques  into  our  tool,  by  incorporating  or  extending 
existing  libraries. 


3  Background  on  Obfuscating  Viruses 

To  detect  obfuscated  viruses,  antivirus  software  have 
become  more  complex.  This  section  discusses  some 
common  obfuscation  transformations  used  by  virus  writ¬ 
ers  and  how  antivirus  software  have  historically  dealt 
with  obfuscated  viruses. 

A  polymorphic  virus  uses  multiple  techniques  to  pre¬ 
vent  signature  matching.  First,  the  virus  code  is  en¬ 
crypted,  and  only  a  small  in-clear  routine  is  designed 
to  decrypt  the  code  before  running  the  virus.  When 
the  polymorphic  virus  replicates  itself  by  infecting  an¬ 
other  program,  it  encrypts  the  virus  body  with  a  newly- 
generated  key,  and  it  changes  the  decryption  routine  by 
generating  new  code  for  it.  To  obfuscate  the  decryption 
routine,  several  transformations  are  applied  to  it.  These 
include:  nop-insertion,  code  transposition  (changing 
the  order  of  instructions  and  placing  jump  instructions  to 
maintain  the  original  semantics),  and  register  reassign¬ 
ment  (permuting  the  register  allocation).  These  trans¬ 
formations  effectively  change  the  virus  signature  (Fig¬ 
ure  1),  inhibiting  effective  signature  scanning  by  an  an¬ 
tivirus  tool. 

The  obfuscated  code  in  Figure  1  will  behave  in  the 
same  manner  as  before  since  the  nop  instruction  has 
no  effect  other  than  incrementing  the  program  counter1. 
However  the  signature  has  changed.  Analysis  can  de¬ 
tect  simple  obfuscations,  like  nop-insertion,  by  using 
regular  expressions  instead  of  fixed  signatures.  To  catch 
nop  insertions,  the  signature  should  allow  for  any  num¬ 
ber  of  nops  at  instruction  boundaries  (Figure  2).  In  fact, 
most  modern  antivirus  software  use  regular  expressions 
as  virus  signatures. 

Antivirus  software  deals  with  polymorphic  viruses 
by  performing  heuristic  analyses  of  the  code  (such  as 
checking  only  certain  program  locations  for  virus  code, 
as  most  polymorphic  viruses  attach  themselves  only  at 
the  beginning  or  end  of  the  executable  binary  [37]),  and 
even  emulating  the  program  in  a  sandbox  to  catch  the 
virus  in  action  [36],  The  emulation  technique  is  effec¬ 
tive  because  at  some  point  during  the  execution  of  the 
infected  program,  the  virus  body  appears  decrypted  in 
main  memory,  ready  for  execution;  the  detection  comes 
down  to  frequently  scanning  the  in-memory  image  of 
the  program  for  virus  signatures  while  the  program  exe¬ 
cutes. 

Metamorphic  viruses  attempt  to  evade  heuristic  de¬ 
tection  techniques  by  using  more  complex  obfuscations. 
When  they  replicate,  these  viruses  change  their  code  in 
a  variety  of  ways,  such  as  code  transposition,  substi¬ 
tution  of  equivalent  instruction  sequences,  and  register 
reassignment  [44,  51].  Furthermore,  they  can  “weave” 
the  virus  code  into  the  host  program,  making  detec¬ 
tion  by  traditional  heuristics  almost  impossible  since  the 
virus  code  is  mixed  with  program  code  and  the  virus  en¬ 


try  point  is  no  longer  at  the  beginning  of  the  program 
(these  are  designated  as  entry  point  obscuring  (EPO) 
viruses  [26]). 

As  virus  writers  employ  more  complex  obfusca¬ 
tion  techniques,  heuristic  virus-detection  techniques  are 
bound  to  fail.  Therefore,  there  is  need  to  perform  a 
deeper  analysis  of  malicious  code  based  upon  more  so¬ 
phisticated  static-analysis  techniques.  In  other  words, 
inspection  of  the  code  to  detect  malicious  patterns 
should  use  structures  that  are  closer  to  the  semantics  of 
the  code,  as  purely  syntactic  techniques,  such  as  regular 
expression  matching,  are  no  longer  adequate. 

3.1  The  Suite  of  Viruses 

We  have  analyzed  multiple  viruses  using  our  tool,  and 
discuss  four  of  them  in  this  paper.  Descriptions  of  these 
viruses  are  given  below. 

3.1.1  Detailed  Description  of  the  Viruses 
Chernobyl  (CIH) 

According  to  the  Symantec  Antivirus  Reseach  Cen¬ 
ter  (SARC),  Chernobyl/CIH  is  a  virus  that  infects  32- 
bit  Windows  95/98/NT  executable  files  [41].  When 
a  user  executes  an  infected  program  under  Windows 
95/98/ME,  the  virus  becomes  resident  in  memory.  Once 
the  virus  is  resident,  CIH  infects  other  files  when  they 
are  accessed.  Infected  files  may  have  the  same  size  as 
the  original  files  because  of  CIH’s  unique  mode  of  in¬ 
fection:  the  virus  searches  for  empty,  unused  spaces  in 
the  file2.  Next  it  breaks  itself  up  into  smaller  pieces  and 
inserts  its  code  into  these  unused  spaces.  Chernobyl  has 
two  different  payloads:  the  first  one  overwrites  the  hard 
disk  with  random  data,  starting  at  the  beginning  of  the 
disk  (sector  0)  using  an  infinite  loop.  The  second  pay- 
load  tries  to  cause  permanent  damage  to  the  computer 
by  corrupting  the  Flash  BIOS. 
zombie-6.b 

The  zOmbie-6. b  virus  includes  an  interesting  feature  - 
the  polymorphic  engine  hides  every  piece  of  the  virus, 
and  the  virus  code  is  added  to  the  infected  file  as  a  chain 
of  differently-sized  routines,  making  standard  signature 
detection  techniques  almost  useless. 
fOsfOrO 

The  fOsfOrO  virus  uses  a  polymorphic  engine  combined 
with  an  EPO  technique  to  hide  its  entry  point.  According 
to  Kaspersky  Labs  [27],  when  an  infected  file  is  run  and 
the  virus  code  gains  control,  it  searches  for  portable  ex¬ 
ecutable  files  in  the  system  directories  and  infects  them. 
While  infecting,  the  virus  encrypts  itself  with  a  polymor¬ 
phic  loop  and  writes  a  result  to  the  end  of  the  file.  To  gain 
control  when  the  infected  file  is  run,  the  virus  does  not 
modify  the  program’s  start  address,  but  instead  writes  a 
“  jmp  (virus_entry)”  instruction  into  the  middle  of 
the  file. 


Original  code 


E8  00000000 

call  Oh 

5B 

pop  ebx 

8D  4B  42 

lea  ecx, 

[ebx  +  42h] 

51 

push  ecx 

50 

push  eax 

50 

push  eax 

0F01  4C  24  FE 

sidt  [esp 

-  02h] 

5B 

pop  ebx 

83  C3  1C 

add  ebx, 

ICh 

FA 

cli 

8B  2B 

mov  ebp, 

[ebx] 

Obfuscated  code 


E8 

00000000 

call  Oh 

5B 

pop  ebx 

8D 

4B  42 

lea  ecx, 

[ebx  +  45h] 

90 

nop 

51 

push  ecx 

50 

push  eax 

50 

push  eax 

90 

nop 

0F01  4C  24  FE 

sidt  [esp 

-  02h] 

5B 

pop  ebx 

83 

C3  1C 

add  ebx, 

ICh 

90 

nop 

FA 

cli 

8B 

2B 

mov  ebp, 

[ebx] 

Signature  New  signature 

E800  0000  005B  8D4B  4251  5050  E800  0000  005B  8D4B  4290  5150 

0F01  4C24  FE5B  83C3  1CFA  8B2B  50S1  0F01  4C24  FE5B  83C3  !  :90 

FA8B  2B 


Figure  1:  Original  code  and  obfuscated  code  from  Chernobyl/CIH,  and  their  corresponding  signatures.  Newly  added 
instructions  are  highlighted. 


E800 

0000 

00 (90) * 

5B (90) * 

8D4B 

42 (90) * 

51  (90) * 

50(90) * 

50 (90) * 

0F01 

4C24 

FE (90) * 

5B  (90) * 
8B2B 

83C3 

1C  (90) * 

FA ( 90 ) * 

Figure  2:  Extended  signature  to  catch  nop-insertion. 


Hare 

Finally,  the  Hare  vims  infects  the  bootloader  sectors 
of  floppy  disks  and  hard  drives,  as  well  as  executable 
programs.  When  the  payload  is  triggered,  the  vims 
overwrites  random  sectors  on  the  hard  disk,  making  the 
data  inaccessible.  The  virus  spreads  by  polymorphically 
changing  its  decryption  routine  and  encrypting  its  main 
body. 

The  Hare  and  Chernobyl/CIH  vimses  are  well  known 
in  the  antivims  community,  with  their  presence  in  the 
wild  peaking  in  1996  and  1998,  respectively.  In  spite 
of  this,  we  discovered  that  current  commercial  virus 
scanners  could  not  detect  slightly  obfuscated  versions 
of  these  viruses. 

4  Obfuscation  Attacks  on  Commercial 
Virus  Scanners 

We  tested  three  commercial  virus  scanners  against 
several  common  obfuscation  transformations.  To  test  the 
resilience  of  commercial  vims  scanners  to  common  ob¬ 
fuscation  transformations,  we  have  developed  an  obfus- 
cator  for  binaries.  Our  obfuscator  supports  four  com¬ 
mon  obfuscation  transformations:  dead-code  insertion, 
code  transposition,  register  reassignment,  and  instruc¬ 
tion  substitution.  While  there  are  other  generic  obfus¬ 


cation  techniques  [11,  12],  those  described  here  seem 
to  be  preferred  by  malicious  code  writers,  possibly  be¬ 
cause  implementing  them  is  easy  and  they  add  little  to 
the  memory  footprint. 

4.1  Common  Obfuscation  Transformations 

4.1.1  Dead-Code  Insertion 

Also  known  as  trash  insertion ,  dead-code  insertion 
adds  code  to  a  program  without  modifying  its  behav¬ 
ior.  Inserting  a  sequence  of  nop  instructions  is  the  sim¬ 
plest  example.  More  interesting  obfuscations  involve 
constructing  challenging  code  sequences  that  modify  the 
program  state,  only  to  restore  it  immediately. 

Some  code  sequences  are  designed  to  fool  antivirus 
software  that  solely  rely  on  signature  matching  as  their 
detection  mechanism.  Other  code  sequences  are  com¬ 
plicated  enough  to  make  automatic  analysis  very  time- 
consuming,  if  not  impossible.  For  example,  passing  val¬ 
ues  through  memory  rather  than  registers  or  the  stack 
requires  accurate  pointer  analysis  to  recover  values.  The 
example  shown  in  Figure  3  should  clarify  this.  The  code 
marked  by  (*)  can  be  easily  eliminated  by  automated 
analysis.  On  the  other  hand,  the  second  and  third  inser¬ 
tions,  marked  by  (**),  do  cancel  out  but  the  analysis  is 
more  complex.  Our  obfuscator  supports  dead-code  in¬ 
sertion. 

Not  all  dead-code  sequence  can  be  detected  and  elim¬ 
inated,  as  this  problem  reduces  to  program  equivalence 
(i.e.,  is  this  code  sequence  equivalent  to  an  empty  pro¬ 
gram?).  which  is  undecidable.  We  believe  that  many 
common  dead-code  sequences  can  be  detected  and  elim¬ 
inated  with  acceptable  performance.  To  quote  the  docu- 


mentation  of  the  RPME  vims  permutation  engine  [50], 

[T]rash  [does  not  make  the]  program  more 
complex  [...].  If  [the]  detecting  algorithm  will 
be  written  such  as  I  think,  then  there  is  no 
difference  between  NOP  and  more  complex 
trash. 

Our  detection  tool,  SAFE,  identifies  several  kinds  of 
such  dead-code  segments. 

4.1.2  Code  Transposition 

Code  transposition  shuffles  the  instructions  so  that  the 
order  in  the  binary  image  is  different  from  the  execu¬ 
tion  order,  or  from  the  order  of  instructions  assumed  in 
the  signature  used  by  the  antivirus  software.  To  achieve 
the  first  variation,  we  randomly  reorder  the  instructions 
and  insert  unconditional  branches  or  jumps  to  restore  the 
original  control-flow.  The  second  variation  swaps  in¬ 
structions  if  they  are  not  interdependent,  similar  to  com¬ 
piler  code  generation,  but  with  the  different  goal  of  ran¬ 
domizing  the  instruction  stream. 

The  two  versions  of  this  obfuscation  technique  differ 
in  their  complexity.  The  code  transposition  technique 
based  upon  unconditional  branches  is  relatively  easy  to 
implement.  The  second  technique  that  interchanges  in¬ 
dependent  instructions  is  more  complicated  because  the 
independence  of  instructions  must  be  ascertained.  On 
the  analysis  side,  code  transposition  can  complicate  mat¬ 
ters  only  for  a  human.  Most  automatic  analysis  tools  (in¬ 
cluding  ours)  use  an  intermediate  representation,  such 
as  the  control  flow  graph  (CFG)  or  the  program  depen¬ 
dence  graph  (PDG)  [23],  that  is  not  sensitive  to  super¬ 
fluous  changes  in  control  flow.  Note  that  an  optimizer 
acts  as  a  deobfuscator  in  this  case  by  finding  the  unnec¬ 
essary  unconditional  branches  and  removing  them  from 
the  program  code.  Currently,  our  obfuscator  supports 
only  code  transposition  based  upon  inserting  uncondi¬ 
tional  branches. 

4.1.3  Register  Reassignment 

The  register  reassignment  transformation  replaces  us¬ 
age  of  one  register  with  another  in  a  specific  live  range. 
This  technique  exchanges  register  names  and  has  no 
other  effect  on  program  behavior.  For  example,  if  reg¬ 
ister  ebx  is  dead  throughout  a  given  live  range  of  the 
register  eax,  it  can  replace  eax  in  that  live  range.  In 
certain  cases,  register  reassignment  requires  insertion  of 
prologue  and  epilogue  code  around  the  live  range  to  re¬ 
store  the  state  of  various  registers.  Our  binary  obfuscator 
supports  this  code  transformation. 

The  purpose  of  this  transformation  is  to  subvert  the 
antivirus  software  analyses  that  rely  upon  signature¬ 
matching.  There  is  no  real  obfuscatory  value  gained  in 
this  process.  Conceptually,  the  deobfuscation  challenge 


is  equally  complex  before  or  after  the  register  reassign¬ 
ment. 

4.1.4  Instruction  Substitution 

This  obfuscation  technique  uses  a  dictionary  of  equiv¬ 
alent  instruction  sequences  to  replace  one  instruction 
sequence  with  another.  Since  this  transformation  re¬ 
lies  upon  human  knowledge  of  equivalent  instructions,  it 
poses  the  toughest  challenge  for  automatic  detection  of 
malicious  code.  The  IA-32  instruction  set  is  especially 
rich,  and  provides  several  ways  of  performing  the  same 
operation.  Coupled  with  several  architecturally  ambiva¬ 
lent  features  (e.g.,  a  memory-based  stack  that  can  be  ac¬ 
cessed  both  as  a  stack  using  dedicated  instructions  and 
as  a  memory  area  using  standard  memory  operations), 
the  IA-32  assembly  language  provides  ample  opportu¬ 
nity  for  instruction  substitution. 

To  handle  obfuscation  based  upon  instruction  substi¬ 
tution,  an  analysis  tool  must  maintain  a  dictionary  of 
equivalent  instruction  sequences,  similar  to  the  dictio¬ 
nary  used  to  generate  them.  This  is  not  a  comprehen¬ 
sive  solution,  but  it  can  cope  with  the  common  cases.  In 
the  case  of  IA-32,  the  problem  can  be  slightly  simplified 
by  using  a  simple  intermediate  language  that  “unwinds” 
the  complex  operations  corresponding  to  each  IA-32  in¬ 
struction.  In  some  cases,  a  theorem  prover  such  as  Sim¬ 
plify  [16]  or  PVS  [39]  can  also  be  used  to  prove  that  two 
sequences  of  instructions  are  equivalent. 

4.2  Testing  Commercial  Antivirus  Tools 

We  tested  three  commercial  virus  scanners  using  ob¬ 
fuscated  versions  of  the  four  viruses  described  earlier. 
The  results  were  quite  surprising:  a  combination  of 
nop-insertion  and  code  transposition  was  enough  to 
create  obfuscated  versions  of  the  viruses  that  the  com¬ 
mercial  virus  scanners  could  not  detect.  Moreover,  the 
Norton  antivirus  software  could  not  detect  an  obfus¬ 
cated  version  of  the  Chernobyl  virus  using  just  nop- 
insertions.  SAFE  was  resistant  to  the  two  obfuscation 
transformations.  The  results  are  summarized  in  Table  1 . 
A  •/  indicates  that  the  antivirus  software  detected  the 
virus.  A  X  means  that  the  software  did  not  detect  the 
virus.  Note  that  unobfuscated  versions  of  all  four  viruses 
were  detected  by  all  the  tools. 

5  Architecture 

This  section  gives  an  overview  of  the  architecture  of 
SAFE  (Figure  4).  Subsequent  sections  provide  detailed 
descriptions  of  the  major  components  of  SAFE. 

To  detect  malicious  patterns  in  executables,  we  build 
an  abstract  representation  of  the  malicious  code  (here 
a  virus).  The  abstract  representation  is  the  “general¬ 
ization”  of  the  malicious  code,  e.g.,  it  incorporates  ob¬ 
fuscation  transformations,  such  as  superfluous  changes 


Original  code 

Code  obfuscated  through 
dead-code  insertion 

Code  obfuscated  through 
code  transposition 

Code  obfuscated  through 
instruction  substitution 

call  Oh 

call  Oh 

call  Oh 

call  Oh 

pop  ebx 

pop  ebx 

pop  ebx 

pop  ebx 

lea  ecx,  [ebx+42h] 

lea  ecx. 

[ebx+42h] 

jmp  S2 

lea  ecx. 

[ebx+42h] 

push  ecx 

nop 

(*> 

S3: 

push  eax 

sub  esp. 

03h 

push  eax 

nop 

(*) 

push  eax 

sidt  [esp 

-  02h] 

push  eax 

push  ecx 

sidt  [esp  -  02h] 

add  [esp] 

,  ICh 

sidt  [esp  -  02h] 

push  eax 

jmp  S4 

mov  ebx. 

[esp] 

pop  ebx 

inc  eax 

(**> 

add  ebx,  ICh 

inc  esp 

add  ebx,  ICh 

push  eax 

jmp  S6 

cli 

cli 

dec  [esp 

-  Oh]  (**) 

S2  : 

lea  ecx,  [ebx+42h  ] 

mov  ebp. 

[ebx] 

mov  ebp,  [ebx] 

dec  eax 

(**) 

push  ecx 

sidt  [esp  -  02h] 

jmp 

S3 

pop  ebx 

S4  : 

pop 

ebx 

add  ebx,  ICh 

cli 

cli 

jmp 

S5 

mov  ebp,  [ebx] 

S5 : 

mov 

ebp,  [ebx] 

Figure  3:  Examples  of  obfuscation  through  dead-code  insertion,  code  transposition,  and  instruction  substitution.  Newly  added 
instructions  are  highlighted. 


Norton® 

Antivirus 

7.0 

McAfee® 

VirusScan 

6.01 

Command® 

Antivirus 

4.61.2 

SAFE 

Chernobyl 

original 

/ 

/ 

y 

/ 

obfuscated 

X111 

X11’2' 

X[1’21 

/ 

z0mbie-6.b 

original 

/ 

✓ 

/ 

/ 

obfuscated 

X11’21 

X11’21 

/ 

fOsfOrO 

original 

y 

y 

y 

/ 

obfuscated 

XM 

Xll’2] 

X11’21 

/ 

Hare 

original 

y 

/ 

/ 

/ 

obfuscated 

xi^r- 

XFTS]— 

XTTJ]— 

/ 

Obfuscations  considered:  41  =  nop-insertion  (a  form  of  dead-code  insertion) 
^  =  code  transposition 

Table  1:  Results  of  testing  various  virus  scanners  on  obfuscated  viruses. 


in  control  flow  and  register  reassignments.  Similarly, 
one  must  construct  an  abstract  representation  of  the  ex¬ 
ecutable  in  which  we  are  trying  to  find  a  malicious  pat¬ 
tern.  Once  the  generalization  of  the  malicious  code  and 
the  abstract  representation  of  the  executable  are  created, 
we  can  then  detect  the  malicious  code  in  the  executable. 
We  now  describe  each  component  of  SAFE. 
Generalizing  the  malicious  code:  Building  the  mali¬ 
cious  code  automaton 

The  malicious  code  is  generalized  into  an  automaton 
with  uninterpreted  symbols.  Uninterpreted  symbols 
(Section  6.2)  provide  a  generic  way  of  representing  data 
dependencies  between  variables  without  specifically  re¬ 
ferring  to  the  storage  location  of  each  variable. 
Pattern-definition  loader 

This  component  takes  a  library  of  abstraction  patterns 
and  creates  an  internal  representation.  These  abstraction 
patterns  are  used  as  alphabet  symbols  by  the  malicious 
code  automaton. 

The  executable  loader 

This  component  transforms  the  executable  into  an  in¬ 
ternal  representation,  here  the  collection  of  control 
flow  graphs  (CFGs),  one  for  each  program  procedure. 


The  executable  loader  (Figure  5)  uses  two  off-the-shelf 
components,  IDA  Pro  and  CodeSurfer.  IDA  Pro  (by 
DataRescue  [42])  is  a  commercial  interactive  disassem¬ 
bler.  CodeSurfer  (by  GrammaTech,  Inc.  [24])  is  a 
program-understanding  tool  that  performs  a  variety  of 
static  analyses.  CodeSurfer  provides  an  API  for  ac¬ 
cess  to  various  structures,  such  as  the  CFGs  and  the  call 
graph,  and  to  results  of  a  variety  of  static  analyses,  such 
as  points-to  analysis.  In  collaboration  with  GrammaT¬ 
ech,  we  have  developed  a  connector  that  transforms  IDA 
Pro  internal  structures  into  an  intermediate  form  that 
CodeSurfer  can  parse. 

The  annotator 

This  component  inputs  a  CFG  from  the  executable  and 
the  set  of  abstraction  patterns  and  produces  an  anno¬ 
tated  CFG,  the  abstract  representation  of  a  program  pro¬ 
cedure.  The  annotated  CFG  includes  information  that 
indicates  where  a  specific  abstraction  pattern  was  found 
in  the  executable.  The  annotator  runs  for  each  proce¬ 
dure  in  the  program,  transforming  each  CFG.  Section  6 
describes  the  annotator  in  detail. 


Figure  4:  Architecture  of  the  static  analyzer  for  executables  (SAFE). 


Figure  5:  Implementation  of  executable  loader 
module. 


The  detector 

This  component  computes  whether  the  malicious  code 
(represented  by  the  malicious  code  automaton)  appears 
in  the  abstract  representation  of  the  executable  (created 
by  the  annotator).  This  component  uses  an  algorithm 
based  upon  language  containment  and  unification.  De¬ 
tails  can  be  found  in  Section  7. 

Throughout  the  rest  of  the  paper,  the  malicious  code 
fragment  shown  in  Figure  6  is  used  as  a  running  exam¬ 
ple.  This  code  fragment  was  extracted  from  the  Cher¬ 
nobyl  virus  version  1 .4. 

To  obtain  the  obfuscated  code  fragment  depicted  (Fig¬ 
ure  7),  we  applied  the  following  obfuscation  transforma¬ 
tions:  dead-code  insertion,  code  transposition,  and  reg¬ 
ister  reassignment.  Incidentally,  the  three  commercial 
antivirus  software  (Norton,  McAfee,  and  Command)  de¬ 
tected  the  original  code  fragment  shown.  However,  the 
obfuscated  version  was  not  detected  by  any  of  the  three 
commercial  antivirus  software. 

6  Program  Annotator 

This  section  describes  the  program  annotator  in  detail 
and  the  data  structures  and  static  analysis  concepts  used 
in  the  detection  algorithm.  The  program  annotator  in¬ 
puts  the  CFG  of  the  executable  and  a  set  of  abstraction 
patterns  and  outputs  an  annotated  CFG.  The  annotated 
CFG  associates  with  each  node  n  in  the  CFG  a  set  of 
patterns  that  match  the  program  at  the  point  correspond¬ 
ing  to  the  node  n.  The  precise  syntax  for  an  abstraction 


Original  code 


WVCTF : 

mov 

eax, 

dr  1 

mov 

ebx, 

[eax+lOh] 

mov 

edi , 

[eax] 

LOWVCTF : 

pop 

jecxz 

mov 

ecx 

SFMM 

esi. 

ecx 

mov 

eax. 

0d601h 

pop 

pop 

call 

jmp 

edx 

ecx 

edi 

LOWVCTF 

SFMM : 

pop 

pop 

stc 

pushf 

ebx 

eax 

Figure  6:  Original  code  fragment  from  Chernobyl  virus 
version  1.4. 


pattern  and  the  semantics  of  matching  are  provided  later 
in  the  section. 

Figure  8  shows  the  CFG  and  a  simple  annotated  CFG 
corresponding  to  the  obfuscated  code  from  Figure  7. 
Note  that  one  node  in  the  annotated  CFG  can  corre¬ 
spond  to  several  nodes  in  the  original  CFG.  For  example, 
the  nodes  annotated  with  “Irrelevantlnstr”  corresponds 
to  one  or  more  nop  instructions. 

The  annotations  that  appear  in  Figure  8  seem  intuitive, 
but  formulating  them  within  a  static-analysis  frame¬ 
work  requires  formal  definitions.  We  enhance  the  SAFE 
framework  with  a  type  system  for  x86  based  on  the  type- 
state  system  described  in  [48].  However,  other  type  sys¬ 
tems  designed  for  assembly  languages,  such  as  Typed 
Assembly  Language  [32,  33],  could  be  used  in  the  SAFE 
framework.  Definitions,  patterns,  and  the  matching  pro¬ 
cedure  are  described  in  Sections  6.1,  6.2  and  6.3  respec¬ 
tively. 


Obfuscated  code 


WVCTF : 

mov 

eax, 

dr  1 

jmp 

Loci 

Loc2  : 

mov 

edi , 

[eax] 

LOWVCTF : 

pop 

ecx 

jecxz 

SFMM 

nop 

mov 

esi, 

ecx 

nop 

nop 

mov 

eax, 

0d601h 

jmp 

Loc3 

Loci : 

mov 

ebx, 

[eax+lOh] 

jmp 

Loc2 

Loc3 : 

pop 

edx 

pop 

ecx 

nop 

call 

edi 

jmp 

LOWVCTF 

SFMM : 

pop 

ebx 

pop 

eax 

push 

eax 

pop 

eax 

stc 

pushf 

Figure  7:  Obfuscated  version  based  upon  code  in  Fig¬ 
ure  6. 


6.1  Basic  Definitions 

This  section  provides  the  formal  definitions  used  in 
the  rest  of  the  paper. 

Program  Points 

An  instruction  I  is  a  function  application,  I  :  n  x 
•  •  •  x  Tfc  — >  r.  While  the  type  system  does  not  preclude 
higher-order  functions  or  function  composition,  it  is  im¬ 
portant  to  note  that  most  assembly  languages  (including 
x86)  do  not  support  these  concepts.  A  program  P  is  a 
sequence  of  instructions  {I\, . . . ,  In)-  During  program 
execution,  the  instructions  are  processed  in  the  sequen¬ 
tial  order  they  appear  in  the  program,  with  the  exception 
of  control-flow  instructions  that  can  change  the  sequen¬ 
tial  execution  order.  The  index  of  the  instruction  in  the 
program  sequence  is  called  a  program  point  (or  program 
counter ),  denoted  by  the  function  pc  :  {I\, . . .  ,In}  — ► 

and  defined  as  pc  (I j)  d=  j,  V  1  <  j  < 

N.  The  set  of  all  program  points  for  program  P  is 

de  f 

ProgramPoints(P)  =  {1, ...,  N}.  The  pc  function 
provides  a  total  ordering  over  the  set  of  program  instruc¬ 
tions. 


Control  Flow  Graph 

A  basic  block  B  is  a  maximal  sequence  of  instructions 
Im)  that  contains  at  most  one  control-flow  in¬ 
struction,  which  must  appear  at  the  end.  Thus,  the  ex¬ 
ecution  within  a  basic  block  is  by  definition  sequential. 
Let  V  be  the  set  of  basic  blocks  for  a  program  P,  and 
E  C  V  x  V  x  {T,  F}  be  the  set  of  control  flow  tran¬ 
sitions  between  basic  blocks.  Each  edge  is  marked  with 
either  T  or  F  corresponding  to  the  condition  ( true  or 
false )  on  which  that  edge  is  followed.  Unconditional 
jumps  have  outgoing  edges  always  marked  with  T.  The 
directed  graph  CFG(P)  =  (V.  E)  is  called  the  control 
flow  graph. 

Predicates 

Predicates  are  the  mechanism  by  which  we  incorporate 
results  of  various  static  analyses  such  as  live  range  and 
points-to  analysis.  These  predicates  can  be  used  in  the 
definition  of  abstraction  patterns.  Table  2  lists  predicates 
that  are  currently  available  in  our  system.  For  example, 
code  between  two  program  points  pi  and  P2  can  be  veri¬ 
fied  as  dead-code  (Section  4. 1.1)  by  checking  that  for  ev¬ 
ery  variable  m  that  is  live  in  the  program  range  [pi,p-2], 
its  value  at  point  p2  is  the  same  as  its  value  at  point  p-\ . 
The  change  in  to’ s  value  between  two  program  points 
Pi  and  p2  is  denoted  by  Delta(m,pi,p2)  and  can  be 
implemented  using  polyhedral  analysis  [14], 

Explanations  of  the  static  analysis  predicates  shown 
in  Table  2  are  standard  and  can  be  found  in  a  compiler 
textbook  (such  as  [34]). 

Instructions  and  Data  Types 

The  type  constructors  build  upon  simple  integer  types 
(listed  below  as  the  ground  class  of  types),  and  allow 
for  array  types  (with  two  variations:  the  pointer-to-start- 
of-array  type  and  the  pointer-to-middle-of-array  type), 
structures  and  unions,  pointers,  and  functions.  Two  spe¬ 
cial  types  _L(n)  and  T (n)  complete  the  type  system  lat¬ 
tice.  _L(n)  and  T (n)  represent  types  that  are  stored  on  n 
bits,  with  _L(n)  being  the  least  specific  (“any”)  type  and 
T (n)  being  the  most  specific  type.  Table  3  describes  the 
constructors  allowed  in  our  type  system. 

The  type  p,(l,r,i)  represents  the  type  of  a  field  mem¬ 
ber  of  a  structure.  The  field  has  a  type  r  (independent  of 
the  types  of  all  other  fields  in  the  same  structure),  an  off¬ 
set  i  that  uniquely  determines  the  location  of  the  field 
within  the  structure,  and  a  label  l  that  identifies  the  field 
within  the  structure  (in  some  cases  this  label  might  be 
undefined). 

Physical  subtyping  takes  into  account  the  layout  of 
values  in  memory  [6,  48].  If  a  type  r  is  a  physical  sub- 
type  of  t'  (denoted  it  by  r  <  r'),  then  the  memory  lay¬ 
out  of  a  value  of  type  t'  is  a  prefix  of  the  memory  layout 
of  a  value  of  type  r.  We  will  not  describe  the  rules  of 
physical  subtyping  here  as  we  refer  the  reader  to  Xu’s 
thesis  [48]  for  a  detailed  account  of  the  typestate  system 


Figure  8:  Control  flow  graph  of  obfuscated  code  fragment,  and  annotations. 


Dominators(B) 

Post.Dominators(B) 

Pred(B) 

Succ(B) 

First(B) 

Last(B) 

Previous(I) 

Next(I) 

Kills(p,  a) 

Uses(p,  a) 

Alias(p,  x,  y) 
LiveRangeStart(p,  a) 
LiveRangeEnd(p ,  a) 
Delta(p,  m,  n) 
Delta(m,pi,p2) 
PointsTo(p,  x,  a) 


the  set  of  basic  blocks  that  dominate  the  basic  block  B 

the  set  of  basic  blocks  that  are  dominated  by  the  basic  block  B 

the  set  of  basic  blocks  that  immediately  precede  B 

the  set  of  basic  blocks  that  immediately  follow  B 

the  first  instruction  of  the  basic  block  B 

the  last  instruction  of  the  basic  block  B 

f  U B'ePrediBj)  Last(B  )  if  I  =  First(Bi) 

l  /'  if  Bt  =  (...,  I,  ...) 

f  Us'e^ccts,)  First(B’)  if  7  =  Last(Bz) 

\  I'  if  73/  =  (...,  7,  I',  ...} 

true  if  the  instruction  at  program  point  p  kills  variable  a 

true  if  the  instruction  at  program  point  p  uses  variable  a 

true  if  variable  x  is  an  alias  for  y  at  program  point  p 

the  set  of  program  points  that  start  the  a’s  live  range  that  includes  p 

the  set  of  program  points  that  end  the  a’s  live  range  that  includes  p 

the  difference  between  integer  variables  m  and  n  at  program  point  p 

the  change  in  m’s  value  between  program  points  pi  and  p2 

true  if  variable  x  points  to  location  of  a  at  program  point  p 


Table  2:  Examples  of  static  analysis  predicates. 


T 

::  ground 

Ground  types 

I  r  [n] 

Pointer  to  the  base  of  an  array  of  type  r  and  of  size  n 

r  (n] 

Pointer  into  the  middle  of  an  array  of  type  r  and  of  size  n 

t  ptr 

Pointer  to  r 

s{/ri,.. 

. ,  /ik}  Structure  (product  of  types  of  p,\) 

u{yu.. 

. ,  /.ik}  Union 

Tl  x  •  •  • 

x  Tk  — *  t  Function 

T  (n) 

Top  type  of  n  bits 

-L  (n) 

Bottom  type  of  n  bits  (type  “any”  of  n  bits) 

P 

::  (l,r,i) 

Member  labeled  l  of  type  t  at  offset  i 

ground 

::  int  (g : 

s:v)  |  uint  ( g:s:v )  |  . . . 

Table  3:  A  simple  type  system. 


(including  sub  typing  rules). 

The  type  int  ( g:s:v )  represents  a  signed  integer, 
and  it  covers  a  wide  variety  of  values  within  storage  lo¬ 
cations.  It  is  parametrized  using  three  parameters  as  fol¬ 
lows:  g  represents  the  number  of  highest  bits  that  are 
ignored,  s  is  the  number  of  middle  bits  that  represent  the 
sign,  and  v  is  the  number  of  lowest  bits  that  represent 
the  value.  Thus  the  type  int  ( g:s:v )  uses  a  total  of 
g  +  s  +  v  bits. 

dg+s+v  ■  ■  •  ^s+v+1  dsy.v  •  •  •  dv+ 1  dv  ...  d± 

ignored  sign  value 

The  type  uint  ( g:s:v )  represents  an  unsigned  integer, 
and  it  is  just  a  variation  of  int  (g:s:v) ,  with  the  mid¬ 
dle  s  sign  bits  always  set  to  zero. 

The  notation  int  ( g:s:v )  allows  for  the  separation 
of  the  data  and  storage  location  type.  In  most  assem¬ 
bly  languages,  it  is  possible  to  use  a  storage  location 
larger  than  that  required  by  the  data  type  stored  in  it.  For 
example,  if  a  byte  is  stored  right-aligned  in  a  (32-bit) 
word,  its  associated  type  is  int  (24:1:7).  This  means 


that  an  instruction  such  as  xor  on  least  significant  byte 
within  32-bit  word  will  preserve  the  leftmost  24  bits  of 
the  32-bit  word,  even  though  the  instruction  addresses 
the  memory  on  32-bit  word  boundary. 

This  separation  between  data  and  storage  location 
raises  the  issue  of  alignment  information,  i.e.,  most  com¬ 
puter  systems  require  or  prefer  data  to  be  at  a  memory 
address  aligned  to  the  data  size.  For  example,  32-bit 
integers  should  be  aligned  on  4-byte  boundaries,  with 
the  drawback  that  accessing  an  unaligned  32-bit  integer 
leads  to  either  a  slowdown  (due  to  several  aligned  mem¬ 
ory  accesses)  or  an  exception  that  requires  handling  in 
software.  Presently,  we  do  not  use  alignment  informa¬ 
tion  as  it  does  not  seem  to  provide  a  significant  covert 
way  of  changing  the  program  flow. 

Figure  9  shows  the  types  for  operands  in  a  section  of 
code  from  the  Chernobyl/CIH  virus.  Table  4  illustrates 
the  type  system  for  Intel  IA-32  architecture.  There  are 
other  IA-32  data  types  that  are  not  covered  in  Table  4,  in¬ 
cluding  bit  strings,  byte  strings,  64-  and  128-bit  packed 
SIMD  types,  and  BCD  and  packed  BCD  formats.  The 


Code 

Type 

call  Oh 

pop  ebx 

ebx  :  _L  (32) 

lea  ecx, 

[ebx  +  42h] 

ecx  :  _L  (32) , 
ebx  :  ptr  _L  (32) 

push  ecx 

ecx  :  _L  (32) 

push  eax 

eax  :  J_  (32) 

push  eax 

eax  :  _L  (32) 

sidt  [esp 

-  02h] 

pop  ebx 

eax  :  _L  (32) 

add  ebx, 
cli 

ICh 

ebx  :  int  (0:1:31) 

mov  ebp, 

[ebx] 

ebp  :  _L  (32) , 
ebx  :  ptr  _L  (32) 

Figure  9:  Inferred  types  from  Chernobyl/CIH  virus  code. 

IA-32  logical  address  is  a  combination  of  a  16-bit  seg¬ 
ment  selector  and  a  32-bit  segment  offset,  thus  its  type 
is  the  cross  product  of  a  16-bit  unsigned  integer  and  a 
32-bit  pointer. 

6.2  Abstraction  Patterns 

An  abstraction  pattern  T  is  a  3-tuple  (V.  O .  C),  where 
V  is  a  list  of  typed  variables,  O  is  a  sequence  of  instruc¬ 
tions,  and  C  is  a  boolean  expression  combining  one  or 
more  static  analysis  predicates  over  program  points.  For¬ 
mally,  a  pattern  T  =  (V,0,C)  is  a  3-tuple  defined  as 
follows: 

V  =  {  X!  :  Ti , .  . . ,  Xk  :  Tfc  } 

O  =  (  I(vi,  .  .  .  ,Vm)  I  I  :  Tl  X  •  •  •  X  Tm  ~ >  T  ) 

C  =  boolean  expression  involving  static 

analysis  predicates  and  logical  operators 

An  instruction  from  the  sequence  O  has  a  number 
of  arguments  (wj)j>o,  where  each  argument  is  either  a 
literal  value  or  a  free  variable  xr  We  write  l'(x\  : 
Ti, . . . ,  Xk  :  Tfc)  to  denote  the  pattern  T  =  (' V, ,  O,  C) 
with  free  variables  X\,. . . ,  Xk-  An  example  of  a  pattern 
is  shown  below. 

T(  X  :  inf  (0  :  1  :  31)  )  = 

(  {  X  :  mt(0  :  1  :  31)  }, 

(  pi  :  “pop  X  ”, 

P2  :  uaddX,  03AFh”  ), 
pi  G  LiveRangeStart(jp2,  X)  ) 

This  pattern  represents  two  instructions  that  pop  a  reg¬ 
ister  X  off  the  stack  and  then  add  a  constant  value  to 
it  (0x03AF).  Note  the  use  of  uninterpreted  symbol  X 
in  the  pattern.  Use  of  the  uninterpreted  symbols  in  a 
pattern  allows  it  to  match  multiple  sequences  of  instruc¬ 
tions,  e.g.,  the  patterns  shown  above  matches  any  instan¬ 
tiation  of  the  pattern  where  X  is  assigned  a  specific  reg¬ 
ister.  The  type  int( 0  :  1  :  31)  of  X  represents  an  integer 
with  31  bits  of  storage  and  one  sign  bit. 


We  define  a  binding  B  as  a  set  of  pairs 
[variable  v,  value  x\.  Formally,  a  binding  B  is  de¬ 
fined  as  {  [x,  v]  |  x  €  V,  x  :  t,  v  :  r\  t  <t’  }.  If  a  pair 
[x,  v]  occurs  in  a  binding  B.  then  we  write  B(x)  =  v. 
Two  bindings  B\  and  £>2  are  said  to  be  compatible  if 
they  do  not  bind  the  same  variable  to  different  values: 

Compatible(Bi ,  B2)  d— 

V  x  6  V.(  [x,  yi]  G  Bi  A  [*,  2/2]  €  B2  ) 

->  (3/1  =  3/2) 

The  union  of  two  compatible  bindings  B\  and  £>2  in¬ 
cludes  all  the  pairs  from  both  bindings.  For  incompatible 
bindings,  the  union  operation  returns  an  empty  binding. 

{  [x,vx\  :  [x,vx]  G  Si  V  [x,vx]  G  82  } 

„  „  def  if  Compatible(Bi,B2) 

01U02  =  ( 

0  if  -1  Compatible  (Si,  B2) 

When  matching  an  abstraction  pattern  against  a  se¬ 
quence  of  instructions,  we  use  unification  to  bind  the 
free  variables  of  I  to  actual  values.  The  function 

Unify  (  (. . .  ,opi(xiti, . . .  ,xi>ni), . .  T) 

returns  a  “most  general”  binding  B  if  the  instruction  se¬ 
quence  (. . . ,  opi  ( -i ' .  1 , . . . ,  Xx^ni ) , . . . )  1  <i<m  can  be  uni¬ 
fied  with  the  sequence  of  instructions  O  specified  in 
the  pattern  F.  If  the  two  instruction  sequences  can¬ 
not  be  unified.  Unify  returns  false.  Definitions  and  al¬ 
gorithms  related  to  unification  are  standard  and  can  be 
found  in  [20], 3 

6.3  Annotator  Operation 

The  annotator  associates  a  set  of  matching  patterns 
with  each  node  in  the  CFG.  The  annotated  CFG  of  a 
program  procedure  P  with  respect  to  a  set  of  patterns 
S  is  denoted  by  If.  Assume  that  a  node  n  in  the 
CFG  corresponds  to  the  program  point  p  and  the  in¬ 
struction  at  p  is  Ip.  The  annotator  attempts  to  match  the 
(possibly  interprocedural)  instruction  sequence  S(n)  = 
(. . . ,  Previous2 {Ip),  Previous{Ip),  Ip)  with  the  pat¬ 
terns  in  the  set  S  =  {F 1 , . . .  ,rm}.  The  CFG  node  n 
is  then  labeled  with  the  list  of  pairs  of  patterns  and  bind¬ 
ings  that  satisfy  the  following  condition: 

Annotationfn )  =  {  [F,  B\  :  T  G  {IT, . . . ,  Fm}  A 

B=  Unify  (S(n),r)} 

If  Unify(S(n),T)  returns  false  (because  unification 
is  not  possible),  then  the  node  n  is  not  annotated  with 
[r,£>].  Note  that  a  pattern  T  might  appear  several 
times  (albeit  with  different  bindings)  in  Annotation(n) . 
However,  the  pair  [r,£>]  is  unique  in  the  annotation  set 
of  a  given  node. 


IA-32  Datatype  Type  Expression 


byte  unsigned  int 

uint  (0:0:8) 

word  unsigned  int 

uint  (0:0:16) 

doubleword  unsigned  int 

uint  (0:0:32) 

quadword  unsigned  int 

uint  (0:0:64) 

double  quadword  unsigned  int 

uint  (0:0:128) 

byte  signed  int 

int  (0:1:7) 

word  signed  int 

int  (0:1:15) 

doubleword  signed  int 

int  (0:1:31 ) 

quadword  signed  int 

int  (0:1:63) 

double  quadword  signed  int 

int  (0:1:127) 

single  precision  float 

float  (0:1:31) 

double  precision  float 

float  (0:1:63) 

double  extended  precision  float 

float  (0:1:79) 

near  pointer 

-1(32) 

far  pointer  (logical  address) 

uint  (0:0:16)  x  uint  (0:0:32)  — >  _L  (48) 

eax,  ebx,  ecx,  edx 

X  (32) 

esi,  edi,  ebp,  esp 

X  (32) 

eip 

int  (0:1:31 ) 

cs,  ds,  ss,  es,  fs,  gs 

-1  (16) 

ax,  bx,  cx,  dx 

X  (16) 

al,  bl,  cl,  dl 

X  (8) 

ah,  bh,  ch,  dh 

J-  (8) 

Table  4:  IA-32  datatypes  and  their  corresponding  expression  in  the  type  system  from  Table  3. 


7  Detector 

The  detector  takes  as  its  inputs  an  annotated  CFG  for 
an  executable  program  procedure  and  a  malicious  code 
automaton.  If  the  malicious  pattern  described  by  the  ma¬ 
licious  code  automaton  is  also  found  in  the  annotated 
CFG,  the  detector  returns  the  sequence  of  instructions 
exhibiting  the  pattern.  The  detector  returns  no  if  the  ma¬ 
licious  pattern  cannot  be  found  in  the  annotated  CFG. 

7.1  The  Malicious-Code  Automaton 

Intuitively,  the  malicious  code  automaton  is  a  gener¬ 
alization  of  the  vanilla  virus,  i.e.,  the  malicious  code  au¬ 
tomaton  also  represents  obfuscated  strains  of  the  virus. 
Formally,  a  malicious  code  automaton  (or  MCA)  A  is  a 
6-tuple  (V,  X,  S,  6,  S0,  F),  where 

•  V  =  {iq  :  Ti, . . . ,  Vk  '■  Tk}  is  a  set  of  typed  variables, 

•  X  =  {ri,...,rn}  is  a  finite  alphabet  of  patterns 
parametrized  by  variables  from  V,  for  1  <  i  <  n, 
P,  -  (Vi,  Oi,  Gi )  where  Vt  C  V, 

•  S'  is  a  finite  set  of  states, 

•  6  :  S  x  X  — >  2s  is  a  transition  function, 

•  So  C  S  is  a  non-empty  set  of  initial  states, 

•  F  C  S  is  a  non-empty  set  of  final  states. 

An  MCA  is  a  generalization  of  an  ordinary  finite-state 
automaton  in  which  the  alphabets  are  a  finite  set  of  pat¬ 
terns  defined  over  a  set  of  typed  variables.  Given  a  bind¬ 
ing  B  for  the  variables  V  =  {  t’i , . . . ,  the  finite-state 
automaton  obtained  by  substituting  B(vf)  for  V{  for  all 
1  <  *  <  k  in  A  is  denoted  by  B(A).  Note  that  B(A)  is 
a  “vanilla”  finite-state  automaton.  We  explain  this  using 


an  example.  Consider  the  MCA  A  shown  in  Figure  10 
with  V  =  {A,  B,C,  D}.  The  automata  obtained  from 
A  corresponding  to  the  bindings  B i  and  B-2  are  shown  in 
Figure  10.  The  uninterpreted  variables  in  the  MCA  were 
introduced  to  handle  obfuscation  transformations  based 
on  register  reassignment.  The  malicious  code  automaton 
corresponding  to  the  code  fragment  shown  in  Figure  6 
(from  the  Chernobyl  virus)  is  depicted  in  Figure  11. 


mov 

esi,  ecx 

mov 

eax,  0d601h 

pop 

edx 

pop 

ecx 

Br  = 

{  [A,  esi], 

[B,  ecx], 

[C,  eax], 

[D,  edx]  } 

mov 

esi. 

eax 

mov 

ebx. 

0d60 lh 

pop 

ecx 

pop 

eax 

£>2  =  {  [A,  esi], 


[B,  eax], 
[C,  ebx], 
[D,  ecx]  } 


Figure  10:  Malicious  code  automaton  for  a  Chernobyl  virus 
code  fragment,  and  instantiations  with  different  register  as¬ 
signments,  shown  with  their  respective  bindings. 


so 


Irrelevant Jump ( ) 


Figure  11:  Malicious  code  automaton  corresponding  to  code  frag¬ 
ment  from  Figure  6. 


7.2  Detector  Operation 

The  detector  takes  as  its  inputs  the  annotated  CFG  /A 
of  a  program  procedure  P  and  a  malicious  code  automa¬ 
ton  MCA  A  =  ( V. ,  £,  S,  S,  So,  F).  Note  that  the  set  of 
patterns  £  is  used  both  to  construct  the  annotated  CFG 
and  as  the  alphabet  of  the  malicious  code  automaton.  In¬ 
tuitively,  the  detector  determines  whether  there  exists  a 
malicious  pattern  that  occurs  in  A  and  Ps.  We  formal¬ 
ize  this  intuitive  notion.  The  annotated  CFG  FV  is  a 
finite-state  automaton  where  nodes  are  states,  edges  rep¬ 
resent  transitions,  the  node  corresponding  to  the  entry 
point  is  the  initial  state,  and  every  node  is  a  final  state. 
Our  detector  determines  whether  the  following  language 
is  empty: 

HPy.)  0  (  |J  L(B(A))\ 

\BeBau  J 

In  the  expression  given  above,  L(P^)  is  the  language 
corresponding  to  the  annotated  CFG  and  Bah  is  the  set 


of  all  bindings  to  the  variables  in  the  set  V.  In  other 
words,  the  detector  determines  whether  there  exists  a 
binding  B  such  that  the  intersection  of  the  languages  FV: 
and  B(A)  is  non-empty. 

Our  detection  algorithm  is  very  similar  to  the  clas¬ 
sic  algorithm  for  determining  whether  the  intersection 
of  two  regular  languages  is  non-empty  [22],  However, 
due  to  the  presence  of  variables,  we  must  perform  unifi¬ 
cation  during  the  algorithm.  Our  algorithm  (Figure  12) 
combines  the  classic  algorithm  for  computing  the  inter¬ 
section  of  two  regular  languages  with  unification.  We 
have  implemented  the  algorithm  as  a  data-flow  analysis. 

•  For  each  node  n  of  the  annotated  CFG  P,\  we  associate 
pre  and  post  lists  and  L^at  respectively.  Each  ele¬ 
ment  of  a  list  is  a  pair  [s,  23],  where  s  is  the  state  of  the 
MCA  A  and  B  is  the  binding  of  variables.  Intuitively,  if 
[s,  B]  £  IA[e,  then  it  is  possible  for  A  with  the  binding 
B  (i.e.  for  B(A))  to  be  in  state  s  just  before  node  n. 

•  Initial  condition:  Initially,  both  lists  associated  with 
all  nodes  except  the  start  node  no  are  empty.  The  pre  list 


associated  with  the  start  node  is  the  list  of  all  pairs  [s,  0], 
where  s  is  an  initial  state  of  the  MCA  A,  and  the  post 
list  associated  with  the  start  node  is  empty. 

•  The  do-until  loop:  The  do-until  loop  updates  the  pre 
and  post  lists  of  all  the  nodes.  At  the  end  of  the  loop,  the 
worklist  WS  contains  the  set  of  nodes  whose  pre  or  post 
information  has  changed.  The  loop  executes  until  the  pre 
and  post  information  associated  with  the  nodes  does  not 
change,  and  a  fixed  point  is  reached.  The  join  operation 
that  computes  L?re  takes  the  list  of  state-binding  pairs 
from  all  of  the  L^ost  sets  for  program  points  preceding 
i  and  copies  them  to  LpAe  only  if  there  are  no  repeated 
states.  In  case  of  repeated  states,  the  conflicting  pairs 
are  merged  into  a  single  pair  only  if  the  bindings  are 
compatible.  If  the  bindings  are  incompatible,  both  pairs 
are  thrown  out. 

•  Diagnostic  feedback:  Suppose  our  algorithm  returns 
a  non-empty  set,  meaning  a  malicious  pattern  is  com¬ 
mon  to  the  annotated  CFG  I\  and  MCA  A.  In  this 
case,  we  return  the  sequence  of  instructions  in  the  ex¬ 
ecutable  corresponding  to  the  malicious  pattern.  This  is 
achieved  by  keeping  an  additional  structure  with  the  al¬ 
gorithm.  Every  time  the  post  list  for  a  node  n  is  updated 
by  taking  a  transition  in  A  (see  the  statement  14  in  Fig¬ 
ure  12),  we  store  the  predecessor  of  the  added  state,  i.e., 
if  T),  BSUB]  is  added  to  Lpost,  then  we  add  an  edge 
from  s  to  S(s,  T)  (along  with  the  binding  Bs  U  B)  in  the 
associated  structure.  Suppose  we  detect  that  U^st  con¬ 
tains  a  state  [s,  Bs],  where  s  is  a  final  state  of  the  MCA 
A.  Then  we  traceback  the  associated  structure  from  s 
until  we  reach  an  initial  state  of  A  (storing  the  instruc¬ 
tions  occurring  along  the  way). 

8  Experimental  Data 

The  three  major  goals  of  our  experiments  were  to 
measure  the  execution  time  of  our  tool  and  find  the 
false  positive  and  negative  rates.  We  constructed  ten 
obfuscated  versions  of  the  four  viruses.  Let  V,j,  (for 
1  <  i  <  1  and  1  <  k  <  10)  denote  the  fc-th  version  of 
the  y-th  virus.  The  obfuscated  versions  were  created  by 
varying  the  obfuscation  parameters,  e.g.,  number  of  nops 
and  inserted  jumps.  For  the  y-th  virus,  V, .  i  denoted  the 
“vanilla”  or  the  unobfuscated  version  of  the  virus.  Let 
All,  M2,  and  M4  be  the  malicious  code  automata 
corresponding  to  the  four  viruses. 

8.1  Testing  Environment 

The  testing  environment  consisted  of  a  Microsoft 
Windows  2000  machine.  The  hardware  configuration 
included  an  AMD  Athlon  1  GHz  processor  and  1  GB 
of  RAM.  We  used  CodeSurfer  version  1 .5  patchlevel  0 
and  IDA  Pro  version  4.1.7.600. 


8.2  Testing  on  Malicious  Code 

We  will  describe  the  testing  with  respect  to  the  first 
virus.  The  testing  for  the  other  viruses  is  analogous. 
First,  we  ran  SAFE  on  the  10  versions  of  the  first 
virus  Viti, . . . ,  Vipo  with  malicious  code  automaton 
Mi.  This  experiment  gave  us  the  false  negative  rate,  i.e., 
the  pattern  corresponding  to  M\  should  be  detected  in 
all  versions  of  the  virus. 


Annotator 

Detector 

avg. 

(std.  dev.) 

avg. 

(std.  dev.) 

Chernobyl 

1.444  s 

(0.497  s) 

0.535  s 

(0.043  s) 

z0mbie-6.b 

4.600  s 

(2.059  s) 

1.149  s 

(0.041  s) 

fOsfOrO 

4.900  s 

(2.844  s) 

0.923  s 

(0.192  s) 

Hare 

9.142  s 

(1.551  s) 

1.604  s 

(0.104  s) 

Table  5:  SAFE  performance  when  checking  obfuscated  viru¬ 
ses  for  false  negatives. 


Next,  we  executed  SAFE  on  the  versions  of  the 
viruses  with  the  malicious  code  automaton  Mj 
(where  i  /  j).  This  helped  us  find  the  false  positive 
rate  of  SAFE. 

In  our  experiments,  we  found  that  SAFE’S  false  pos¬ 
itive  and  negative  rate  were  0.  We  also  measured 
the  execution  times  for  each  run.  Since  IDA  Pro  and 
CodeSurfer  were  not  implemented  by  us,  we  did  not 
measure  the  execution  times  for  these  components.  We 
report  the  average  and  standard  deviation  of  the  execu¬ 
tion  times  in  Tables  5  and  6. 


Annotator 

Detector 

avg. 

(std.  dev.) 

avg. 

(std.  dev.) 

z0mbie-6.b 

3.400  s 

(1.428  s) 

1.400  s 

(0.420  s) 

fOsfOrO 

4.900  s 

(1.136  s) 

0.840  s 

(0.082  s) 

Hare 

1.000  s 

(0.000  s) 

0.220  s 

(0.019  s) 

Table  6:  SAFE  performance  when  checking  obfuscated  viru¬ 
ses  for  false  positives  against  the  Chernobyl/CIH  virus. 


8.3  Testing  on  Benign  Code 

We  considered  a  suite  of  benign  programs  (see  Sec¬ 
tion  8.3.1  for  descriptions).  For  each  benign  program, 
we  executed  SAFE  on  the  malicious  code  automaton 
corresponding  to  the  four  viruses.  Our  detector  reported 
“negative”  in  each  case,  i.e.,  the  false  positive  rate  is  0. 
The  average  and  variance  of  the  execution  times  are  re¬ 
ported  in  Table  7.  As  can  be  seen  from  the  results,  for 
certain  cases  the  execution  times  are  unacceptably  large. 
We  will  address  performance  enhancements  to  SAFE  in 
the  future. 

8.3.1  Descriptions  of  the  Benign  Executables 

•  tiffdither.exe  is  a  command  line  utility  in  the  cygwin 
toolkit  version  1.3.70,  a  UNIX  environment  for  Win¬ 
dows  developed  by  Red  Hat. 


Input: 

A  list  of  patterns  E  =  [  I\ .....  P, },  a  malicious  code  automaton  A  = 

(V,  E,  S,  6.  So,  F),  and  an  annotated  CFG  =<  N,E  >. 

Output:  true  if  the  program  is  likely  infected,  false  otherwise 
MaliciousCodeChecking(E,  A,  Pe) 

0) 

L Pre  <—  {  [s,  0]  s  £  Sq  },  where  no  €  N  is  the  entry  node  of  Pe 

(2) 

foreach  n  £  N  \  {?i0}  do  LPfe  <—  0 

(3) 

foreach  rtgJVdo  <—  0 

(4) 

WS  «-  0 

(5) 

do 

(6) 

WSoid  «-  WS 

(7) 

WS  <—  0 

(8) 

foreach  n  £  N 

//  update  pre  information 

(9) 

if  r  pre  /  I  I  rpost 

/  UmG  Previous  (n) 

GO) 

t  pre  ^ _  1  1  t  post 

^ n  ^Jm^Previous(n)  m 

(11) 

WS  <—  WS  U  {n} 

(12) 

foreach  n  £  N 

//  update  post  information 

(13) 

NewLP°st  <-  0 

(14) 

foreach  [s,  Bs\  £  LPre 

(15) 

foreach  [r,  B\  £  Annotation(n ) 

//  follow  a  transition 

(16) 

A  Compatible(Bs,  B) 

(17) 

add  [  6(s,  r),  Bs  U  B }  to  NewLP°st 

(18) 

if  LP°st  ±  NewLP°3t 

(19) 

Lposi  NewLP°st 

(20) 

WS  4-  WS  U  {n} 

(21) 

until  WS  =  0 

(22) 

return  3  n  £  N  .  3  [s,  Bs\  £  LPfst  .  s  £  F 

Figure  12:  Algorithm  to  check  a  program  model  against  a  malicious  code  specification. 


•  winmine.exe  is  the  Microsoft  Windows  2000  Mine¬ 
sweeper  game,  version  5.0.2135.1. 

•  spyxx.exe  is  a  Microsoft  Visual  Studio  6.0  Spy++  util¬ 
ity,  that  allows  the  querying  of  properties  and  monitoring 
of  messages  of  Windows  applications.  The  executable 
we  tested  was  marked  as  version  6.0.8168.0. 

•  QuickTimePlayer.exe  is  part  of  the  Apple  QuickTime 
media  player,  version  5.0.2.15. 

9  Conclusion  and  Future  Work 

We  presented  a  unique  view  of  malicious  code  detec¬ 
tion  as  a  obfuscation-deobfuscation  game.  We  used  this 
viewpoint  to  explore  obfuscation  attacks  on  commercial 
virus  scanners,  and  found  that  three  popular  virus  scan¬ 
ners  were  susceptible  to  these  attacks.  We  presented  a 
static  analysis  framework  for  detecting  malicious  code 
patterns  in  executables.  Based  upon  our  framework,  we 
have  implemented  SAFE,  a  static  analyzer  for  executa¬ 
bles  that  detects  malicious  patterns  in  executables  and  is 
resilient  to  common  obfuscation  transformations. 

For  future  work,  we  will  investigate  the  use  of  the¬ 
orem  provers  during  the  construction  of  the  annotated 
CFG.  For  instance,  SLAM  [2]  uses  the  theorem  prover 
Simplify  [16]  for  predicate  abstraction  of  C  programs. 
Our  detection  algorithm  is  context  insensitive  and  does 


not  track  the  calling  context  of  the  executable.  We  will 
investigate  the  use  of  the  push-down  systems,  which 
would  make  our  algorithm  context  sensitive.  However, 
the  existing  PDS  formalism  does  not  allow  uninterpreted 
variables,  so  it  will  have  to  be  extended  to  be  used  in  our 
context. 

Availability 

The  SAFE  prototype  remains  in  development  and  we 
are  not  distributing  it  at  this  time.  Please  contact  Mi- 
hai  Christodorescu,  mihai@cs.wisc.edu,  for  further 
updates. 
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Notes 

1Note  that  the  subroutine  address  computation  had  to  be  updated  to 
take  into  account  the  new  nops.  This  is  a  trivial  computation  and  can 
be  implemented  by  adding  the  number  of  inserted  nops  to  the  initial 
offset  hard-coded  in  the  virus-morphing  code. 

2  Most  executable  formats  require  that  the  various  sections  of  the 
executable  file  start  at  certain  aligned  addresses,  to  respect  the  target 
platform’s  idiosyncrasies.  The  extra  space  between  the  end  of  one 
section  and  the  beginning  of  the  next  is  usually  padded  with  nulls. 

3 We  use  one-way  matching  which  is  simpler  than  full  unification. 
Note  that  the  instruction  sequence  does  not  contain  any  variables.  We 
instantiate  variables  in  the  pattern  so  that  they  match  the  corresponding 
terms  in  the  instruction  sequence. 


